Question 1

Question 2

In order to be analyze relationships between CPI (consumer price index) and unemployment rates across time, the variables we will keep are: the CPI variable (cpi), the unemployment rate variable (unemployment_rates), one variable each for year and month (year and month). After using the annual average column (annual) to fill in the missing monthly unemployment data for September 1953, as noted below, I discarded that variable. Similarly, I discarded the half1_half2 variable. However, if I wanted to analyze annual averages over time, I would have kept both of those variables, and wrangled the half1_half2 variable to provide an annual CPI average to contrast with the yearly unemployment rate average.

I found several peculiarities through multiple sanity checks. In table 1, the row for the year 1947 was not located sequentially (prior to 1948 in ascending order as the rest of the years were arranged), but in fact was located in the dataset between the years 1967 and 1968, so it wasn’t immediately clear from a quick visual scan that the data began in 1947. Table 1 was also missing data for the year 1984.

In table 2, there is a negative value for April 1977, and September 1953 data was missing. The negative value in 1977 is peculiar as it is nonsensical to have a negative unemployment rate. To deal with both problems, it was the case that since we have the data from the other 11 months in both years, and the annual averages for both years, it was possible to calculate what the missing values were.

Additionally, all the values for 1961 are 0.0. Based on the rest of the data, it seems unlikely that these are the true values for 1961. It might be reasonable to consider this missing data, but I decided to keep it for now prior to doing any analysis. Based on what analysis I decided to do later, I might decide to treat this year as missing data and drop it.

Another peculiarity is that I noticed that in table2.xlsx, the data for each month is given with one decimal place, but the data under the “Annual” column is given as integers. Assuming that the “Annual” column in table 2 indicates the annual average unemployment rate for that year, it would be important to make adjustments for the rounding that occurs. For example, in the year 1948 in table2.xlsx, the average unemployment rate as calculated directly from the 12 months of data is 3.8, but the Annual variable is given as 4. Fascinatingly, once the .xlsx file was read into the .rmd file with the read_xlsx() function, the values in the Annual column became a double type variable, with values given to one decimal point, rather than integers as in the .xlsx file.

In order to deal with handling NA values, since there was no data from 1947 in table 2 and no data from 1984 in table 1, I deleted the row of data from 1947 from table 1 and the 1984 row from table 2 so as to not end up with years in the final tidy dataframe that contained data only about CPI and nothing about unemployment rates, or vice versa, as for any purposes of comparison of CPI and unemployment rates, those years would be useless.

Question 3

Question 4

For comparson: head of original anscombe dataset
x1 x2 x3 x4 y1 y2 y3 y4
10 10 10 8 8.04 9.14 7.46 6.58
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47
14 14 14 8 9.96 8.10 8.84 7.04
For comparson: head of tidy anscombe dataset
dataset x y
1 10 8.04
1 8 6.95
1 13 7.58
1 9 8.81
1 11 8.33
1 14 9.96

To tidy the anscombe dataset, I first used the pivot_longer() function to pivot all columns so that there was a column called variable with the x_i ory_i character string as values, and another column called value with the value associated with each x_i ory_i. I arranged the rows in ascending order so that all of the x_1 values were listed first, next the x2, etc. Next I created a new column called y_value that duplicated all of the values in the value column, a column which would eventually end of as storing the y values. I then used the shift.column() function from the useful package to shift all the values associated with the y_i rows, 44 rows up, to align the values in the correct rows corresponding to the same x_i pair it should be associated with. This meant that for each x_i, y_i observation, the associated x and y values would be stored in the same row, each in their own column. Then I renamed some variables for clarity. Finally, I created a new variable dataset that listed whether a pair was part of dataset 1, 2, 3, or 4, based on whether it had come from an x_1, y_1, x_2, y_2, x_3, y_3, or x_4, y_4 value pair respectively.